so far in this series we have covered:
Git/GitHub for versioning and sharing our coderenv for reproducing our code’s dependenciestargets for running our project as a pipelineWhat do we need in order to put these pieces together for “production”?
I highly recommend bookmarking the following as a reference, as much of the material in the following sections aligns with the lessons from this book:
Data science alone is pretty useless.
[What matters] is whether your work is useful. That is, whether it affects decisions at your organization or in the broader world.
That means you must share your work by putting it in production.
How do you currently share your work?
(reminder to self: this isn’t a rhetorical question. put answers/typical patterns on the board)
What does it mean to “put something into production”?
Many data scientists think of in production as an exotic state where supercomputers run state-of-the-art machine learning models run over dozens of shards of data, terabytes each. There’s a misty mountaintop in the background, and there’s no Google Sheet, CSV file, or half-baked database query in sight.
But that’s a myth. If you’re a data scientist putting your work in front of someone else’s eyes, you are in production.
In my experiences as a consultant I have seen:
I could go on.
I mean, I’ve “put things into production” in ways that are, in retrospect, quite funny.
I ran these reports every week and shared them with other people (read: r/cfb) by directly committing html files to a GitHub repository, which then built and deployed them on GitHub Pages.
This meant I was version controlling ~130 pretty beefy html files weekly.
GitHub Pages was really not intended for that.
My cfb repository is now like 11GB due to storing all of those versions.
I still haven’t really figured out what do with that, and have instead punted to a new repository.
The better way to “deploy” a bunch of html pages, by the way, is to just render them to a cloud storage bucket and grant public access to that bucket.
Is this the most sophisticated and mature way to put the results of this project into production?
Is this the most sophisticated and mature way to put the results of this project into production?
Nonetheless, this is result that I’m putting in front of other people; ergo, it’s in production.
For some organizations, in production means a report that gets rendered and emailed around. For others, it means hosting a live app or dashboard that people visit. For the most sophisticated, it means serving live predictions to another service from a machine learning model via an application programming interface (API).
Regardless of the maturity or the form, every organization wants to know that the work is reliable, the environment is safe, and that the product will be available when people need it.
So, how do we do this? This is where the philosophy/idea of DevOops comes into play.
Consider what we have covered so far in these workshops.
We’ve discussed how to version our code and share it in an external repository so that it can be accessed, run, and edited by others.
We’ve discussed how to create reproducible environments with renv so that other people can restore the exact requirements needed to run our code.
We’ve discussed how to create pipelines with targets so that others can easily re-run our project and produce the same output that we did.
We’ve discussed how to use targets to train competing models and produce finalized models.
DevOps principles aim to create software that builds security, stability, and scalability into the software from the very beginning. The idea is to avoid building software that works locally, but doesn’t work well in collaboration or production.
So much of DevOps boils down to preventing the well-it-runs-on-my-machine problem.
DevOps principles aim to create software that builds security, stability, and scalability into the software from the very beginning. The idea is to avoid building software that works locally, but doesn’t work well in collaboration or production.
The code you’re writing relies on the environment in which it runs. While most data scientists have ways to share code, sharing environments isn’t always standard practice, but it should be.
We can take lessons from DevOps, where the solution is to create explicit linkages between the code and the environment so you can share both.
How close are we to creating fully reproducible environments via code? What are we missing?
How close are we to creating fully reproducible environments via code? What are we missing?
We’ve only really covered one layer:
renv and venv allow us to create isolated virtual environments in which to execute our code.
your data science environment is the stack of software and hardware below your code, from the R and Python packages you’re using right down to the physical hardware your code runs on.
Packages are just one piece; we want to be able to make the entire environment reproducible.
This means we need to be comfortable with creating and using environments via code; this is the crux of DevOps that we need to apply to our data science practice.
..
The DevOps term for this is that environments are stateless or in the phrase that environments should be “cattle, not pets”. That means that you can use standardized tooling to create and destroy functionally identical copies of the environment without secret state being left behind.
We’ve covered creating and taking down one layer:
renv and venv allow us to create isolated virtual environments in which to execute our code.
But there are three main layers to think about:
Think about everything needed to run the work we’ve covered so far.R/RStudio, Quarto, Git, all of the underlying libraries that are used in the background when you’re installing a package from source and you’re praying that the installation is okay.
API keys, database credentials, ODBC drivers…
But there are three main layers to think about:
packages: R + Python packages (dplyr, pandas)
system: R; Python; Quarto; Git; Libraries (Fortran, C/C++)
hardware: physical/virtual hardware on which your code runs
Your code has to actually run on something. Even if it’s in the cloud it’s still running on a physical machine somewhere.
So, putting things in production in a safe and reliable way starts with recognizing the different pieces we need to recreate our data science environment.
So, putting things in production in a safe and reliable way starts with recognizing the different pieces we need to recreate our data science environment.
Then, it becomes a matter of reproducing each of these pieces via code. This part sounds super complicated, and it can be, but a lot of smart people have put a lot of time into making it easier.
Let’s revisit the GitHub action we saw earlier.
# name: updating the README
#
# on:
# workflow_dispatch:
# push:
# branches: [ "main", "dev"]
#
# jobs:
# build:
# runs-on: ubuntu-latest
# permissions:
# contents: write
#
# strategy:
# matrix:
# r-version: ['4.4.1']
#
# steps:
# - name: Checkout repository
# uses: actions/checkout@v4
#
# - name: Set up Quarto
# uses: quarto-dev/quarto-actions/setup@v2
#
# - name: Set up R ${{ matrix.r-version }}
# uses: r-lib/actions/setup-r@v2
# with:
# r-version: ${{ matrix.r-version }}
# use-public-rspm: true
#
# - name: Install additional Linux dependencies
# if: runner.os == 'Linux'
# run: |
# sudo apt-get update -y
# sudo apt-get install -y libgit2-dev libglpk40
#
# - name: Setup renv and install packages
# uses: r-lib/actions/setup-renv@v2
# with:
# cache-version: 1
# env:
# RENV_CONFIG_REPOS_OVERRIDE: https://packagemanager.rstudio.com/all/latest
# GITHUB_PAT: ${{ secrets.GH_PAT}}
#
# - name: Render README
# shell: bash
# run: |
# git config --global user.name ${{ github.actor }}
# quarto render README.qmd
# git commit README.md -m 'Re-build README.qmd' || echo "No changes to commit"
# git push origin || echo "No changes to commit"
#This is essentially just a script that:
GitHub repositoryQuartoRrenv to install packages based on renv.lock in the repositoryQuarto README and commits/pushes it to the repositoryNow, to be clear, this is a lot of work to just render a goddamn README.
But we use the same setup to do more elaborate work, such as running the whole dang pipeline via a Github Action.
We’ve been building pipelines with targets.
For instance, we’ve been building a pipelines with targets.
What would we need to
we have to date covered:
renv for reproducing our code dependenciestargets for creating repeatable pipelinesputting things into production is a matter of managing environments
What is the typical output of a data science project?
a job: a script that trains a model, updates a dataset, writes to a database
an app: created in Shiny, Streamlit, Dash,
a report: a presentation, book, article, that is rendered from code
an API